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Fast  Optical  Ray  Tracing  Using  Multiple  DSPs 
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Eugene  Waluschka,  Semion  Kizhner,  Gabriel  Colon,  and  Colleen  Weeks 


Abstract — Optical  ray  tracing  is  a  computationally  intensive 
operation  that  is  central  both  to  the  design  of  optical  systems 
and  to  analyzing  their  performance  once  built.  The  authors  have 
previously  reported  on  the  use  of  parallel  digital  signal  processors 
(DSPs)  to  reduce  the  time  required  to  perform  ray  tracing  in 
analyzing  the  performance  of  the  moderate  resolution  imaging 
spectroradiometer  (MODIS),  which  is  presently  in  orbit  on  multi¬ 
ple  spacecraft.  The  earlier  work  was  incomplete,  providing  only  a 
conservative  estimate  of  the  performance  improvement  that  could 
be  achieved  with  one  to  four  DSPs.  This  paper  reports  on  the 
completed  project  and  extends  the  earlier  work  to  eight  DSPs.  As 
predicted  in  the  earlier  paper,  not  all  rays  make  it  through  the 
entire  optical  system.  Many  are  lost  along  the  way.  This  is  one 
factor  that  led  to  reduced  processing  time.  Another  is  the  use  of 
an  optimizing  compiler.  In  this  paper,  the  authors  present  results 
showing  the  separate  effect  of  each  of  these  two  independent 
factors  on  the  overall  processing  time.  The  most  significant  finding 
is  the  extraordinarily  linear  relationship  between  the  number  of 
DSPs  available  and  the  speed  of  the  ray  tracing.  By  using  eight 
DSPs,  the  processing  time  is  reduced  from  two  weeks  to  less  than 
one  and  a  half  days,  an  improvement  of  nearly  a  whole  order  of 
magnitude.  Low-cost  high-speed  ray  tracing  is  now  feasible  using 
off-the-shelf  plug-in  processor  boards. 

Index  Terms — Digital  signal  processor  (DSP),  moderate  reso¬ 
lution  imaging  spectroradiometer  (MODIS),  optical  ray  tracing, 
optics,  parallel  processing,  reconfigurable  computing,  resistance- 
capacitance  ( RC ). 

I.  Introduction 

RAY-TRACING  simulations  are  an  essential  part  of  the 
design  of  optical  systems  as  well  as  analyzing  their 
performance  after  their  construction  [1],  [2].  Unfortunately, 
the  time  required  for  tracing  substantial  numbers  of  rays  can 
be  staggering.  In  an  earlier  article  [3],  we  reported  on  pre¬ 
liminary  results  entailing  the  use  of  from  one  to  four  digital 
signal  processors  (DSP)  to  perform  such  simulations  for  the 
moderate  resolution  imaging  spectroradiometer  (MODIS):  an 
earth- sensing  instrument  currently  in  orbit  on  NASA’s  Terra 
and  Aqua  satellites  [4] .  We  summarized  the  essentials  of  optical 
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ray  tracing  in  that  article  and  described  the  MODIS  instrument, 
whose  performance  we  simulated.  In  this  paper,  we  explain 
how  we  tested  the  completed  system,  give  measurements  of 
the  time  taken  to  complete  the  simulations,  and  analyze  the 
relationship  between  the  number  of  DSPs  available  and  the 
speed  of  the  system. 

A  characteristic  of  the  MODIS  is  that  a  calibration  is  per¬ 
formed  once  in  each  orbit.  Although  the  MODIS  ordinarily 
takes  images  of  the  earth’s  surface,  during  calibration  it  takes 
an  image  of  the  sun  (not  directly  but  as  projected  onto  a  diffuse 
surface).  This  provides  a  more-or-less  uniform  illumination  of 
all  the  picture  elements  on  the  image  plane’s  detectors.  Because 
the  sun’s  radiance  is  so  great,  its  light  is  attenuated  before 
being  projected  onto  the  diffuser.  Instead  of  using  neutral- 
density  filters,  MODIS  uses  a  screen  perforated  with  numerous 
pinholes.  It  reduces  the  irradiance  by  91.5%.  Some  aspects  of 
the  diffuser  are  discussed  in  Waluschka  et  al.  [5]. 

After  the  plane  of  the  entrance  aperture,  the  attenuation 
screen,  and  the  diffuser,  there  follow  a  further  25  optical  sur¬ 
faces  starting  with  a  mirror  and  ending  with  an  image  plane 
containing  an  array  of  optical  sensors. 

II.  Previous  Results 

Generally  speaking,  only  a  subset  of  the  optical  rays  im¬ 
pinging  on  an  optical  system  ever  reaches  the  image  plane. 
Many  rays  are  lost  along  the  way  because  they  pass  outside 
one  or  another  of  the  apertures  encountered  within  the  optical 
instrument.  In  our  earlier  work  [3],  we  were  very  conservative 
in  estimating  the  time  required  to  complete  a  simulation.  We 
had  not  then  completed  the  ray-tracing  program:  It  did  not  yet 
trace  all  the  rays  of  interest.  We  instead  repeatedly  traced  a 
single  ray,  one  known  to  reach  the  image  plane,  as  it  proceeded 
from  the  diffuser  to  the  image  plane.1 

We  measured  the  time  it  took  to  trace  349  932  instances 
of  this  single  ray  and  extrapolated  from  those  measurements 
the  time  it  would  take  to  trace  the  6.24  x  109  rays  of  a  com¬ 
plete  simulation.  These  results  appear  in  Table  I.  A  key  result 
is  the  number  of  rays  traced  each  second  for  each  parallel 
processor  available:  2848  rays  •  (s  •  processor)-1.  An  earlier 
simulation  done  using  a  program  written  in  FORTRAN  and  ex¬ 
ecuted  on  a  single  Digital  Equipment  Corporation  (DEC)  Alpha 
3000  series  model  800  computer  achieved  a  rate  of  roughly 
5160  rays  •  s-1. 

!More  recently,  once  we  had  completed  the  program,  we  confirmed  our 
earlier  certainty  that  many  simulated  rays  never  reached  the  image  plane.  Every 
such  abandoned  ray  reduced  the  overall  simulation  time,  giving  apparently 
better  performance  than  our  earlier  estimates  suggested  we  should  expect. 
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TABLE  I 

Previously  Reported  Time  to  Trace  349932  Rays 


Number 

of 

Processors 

Elapsed 

Time 

/  s 

Time  per 
Ray 

/|OS 

Expected 
Completion 
Time  /  day 

Rays 

Traced 

/  s 

1 

122.861 

351.1 

25.4 

2  848 

2 

61.449 

175.6 

12.7 

5  695 

3 

40.942 

117.0 

8.4 

8  547 

4 

30.721 

87.8 

6.3 

11389 

Rate  of  ray  tracing:  2  848  rays  traced  •  s  1  •  processor  1 . 


Detector 


Fig.  1.  MODIS  optical  system. 

III.  Experimental  Setup 

A.  Implementation 

We  implemented  the  system  on  a  Dell  Precision  Workstation 
530  MT  based  on  an  Intel  Xeon  CPU  operating  at  a  frequency 
of  1.70  GHz.  This  system  used  a  400-MHz  system  bus,  an 
8-kB  LI  cache,  and  a  256-MB  L2  cache.  It  also  employed 
two  Bittware  Hammerhead  66-MHz  PCI  boards,  each  con¬ 
taining  four  Analog  Devices  ADSP21160  DSPs  operating  at 
80  MHz.2  Our  approach  was  to  use  the  PC  to  direct  and 
coordinate  the  operation  of  the  eight  DSPs.3  The  PC  could  have 
been  used  to  perform  some  of  the  ray  tracing  from  the  diffuser 
to  the  image  plane.  This  was  not  done,  however,  because  we 
wanted  to  make  it  easy  to  determine  the  relationship  between 
the  number  of  DSPs  available  and  the  number  of  rays  traced 
each  second.  In  practice,  it  would  be  desirable  to  let  the  PC 
perform  ray  tracing  whenever  it  was  otherwise  idle. 

B.  Simulation 

The  MODIS  optical  system  is  shown  schematically  in  Fig.  1. 
The  attenuator  screen  has  1951  pinholes.  We  traced  rays  from 
485  of  these,  the  same  ones  used  in  the  original  FORTRAN 
simulation  program.  From  the  center  of  each  pinhole,  we  simu¬ 
lated  launching  a  rectangular  array  of  21  rows  x  21  rays/row  = 
441  rays.  The  central  ray  of  this  grouping  was  based  on  the 
direction  of  the  sun  from  the  entrance  pupil,  and  the  transverse 
extent  of  the  grouping  was  based  on  the  sun’s  angular  extent 
(semidiameter),  roughly  0.25°. 

We  used  the  host  PC  to  trace  each  ray  from  a  pinhole  to  the 
diffuser.  Some  of  the  rays  that  departed  the  pinholes  missed  the 
diffuser;  these  were  discarded.  Of  those  that  struck  the  diffuser, 
each  ray  was  then  used  to  generate  a  second  rectangular  array  of 


A:  Previous  results  ready.  DSP  ready  for  new  data. 
B:  PC  accepts  results,  assigns  new  task. 

C:  DSP  waiting  for  release. 

D:  PC  releases  DSP. 


Fig.  2.  Synchronization  of  the  PC  and  one  DSP.  This  is  a  full  handshaking 
scheme  using  polling.  The  same  arrangement  exists  for  each  DSP  in  the  system. 


241  rows  x  121  rays/row  =  29  161  rays.  This  collection  of  rays 
was  traced  by  an  assigned  DSP,  which  discarded  any  rays  that 
did  not  reach  the  image  plane.  Of  the  rays  that  did  reach  it,  the 
DSP  tallied  the  number  of  rays  that  struck  each  cell  in  the  image 
plane.  These  cells  were  arranged  in  12  rows  x  10  cells/row  = 
120  cells.  Upon  completing  its  assigned  task,  the  DSP  notified 
the  PC  that  it  was  ready  for  a  new  assignment. 

When  the  PC  discovered  a  waiting  DSP,  it  retrieved  its 
120-element  tally  and  gave  it  a  new  point  on  the  diffuser  from 
which  to  trace  a  further  29  161  rays.  The  PC  could  have  accu¬ 
mulated  these  results.  In  fact,  we  chose  not  to  do  this,  instead 
saving  a  subset  of  the  results  for  inspection  and  verification. 
Comparison  of  numerous  120-element  tallies  with  those  from 
the  earlier  FORTRAN  program  showed  no  difference  at  all, 
even  though  the  FORTRAN  program  used  double-precision 
floating-point  arithmetic,  whereas  the  PC  and  the  DSPs  only 
used  single-precision  floating-point  arithmetic. 

To  synchronize  the  PC  and  the  DSP,  we  took  advantage  of 
the  fact  that  the  Hammerhead  board  allows  two-port  access  to 
the  memory  of  the  DSP.  We  implemented  a  full  handshaking 
scheme,  as  illustrated  in  Fig.  2.  To  do  this,  we  set  up  a  pair  of 
mailboxes  in  each  DSP’s  memory,  one  for  the  DSP  and  one  for 
the  PC.  When  the  DSP  was  ready  for  the  PC  to  service  it,  the 
DSP  set  its  flag  (point  A  in  the  figure).  The  PC  polled  this  flag 
from  time  to  time  and  set  its  own  flag  once  it  discovered  the 
DSP  had  set  its  flag  (B).  At  this  point  the  DSP  reset  its  flag  (C). 
The  PC  now  retrieved  the  results  of  the  DSP’s  latest  ray-tracing 
simulation  and  gave  the  DSP  the  new  coordinates  in  the  diffuser 
plane  corresponding  to  its  next  assignment.  Then,  the  PC  reset 
its  flag,  leaving  the  DSP  to  start  its  next  task  (D). 

We  performed  eight  complete  simulations.  Each  simulation 
entailed  tracing  all 


485  pinholes  x 


441  pinhole  rays 
pinhole 


x 


29  161  diffuser  rays 
pinhole  ray 


=  62371 00  48 5  diffuser  rays 


and  each  was  performed  using  a  different  number  of  DSPs, 
from  one  to  eight.  The  measured  simulation  times  and  corre¬ 
sponding  rates  of  tracing  rays  are  shown  in  Table  II.  Two  factors 
are  responsible  for  the  increase  in  the  ray-tracing  rate  from  that 
shown  in  Table  I. 


2In  our  earlier  work  [3],  we  used  only  a  single  Hammerhead  board. 

3  The  programs  running  in  the  PC  as  well  as  all  eight  DSPs  were  written  in 
the  C  programming  language. 


1)  The  completed  program  was  compiled  with  optimization 
enabled,  leading  to  a  more  efficient  code  in  the  PC  and, 
more  importantly,  in  each  DSP. 
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TABLE  II 

Running  the  Simulation  on  Different  Numbers  of  Processors 


#  of 

Processors 

Available 

Simulation 

Time  / 
ks 

Rate  of 

Ray  Tracing  / 
krays • s-1 

1 

1023.877  625 

6.091646  44 

2 

511.936109 

12.183  357  21 

3 

341.290  782 

18.275  033  53 

4 

255.967  797 

24.366  738  93 

5 

204.774  219 

30.458  426  43 

6 

170.645  281 

36.550  090  62 

7 

146.267812 

42.641647  53 

8 

127.984  234 

48.733  35012 

#  of  rays  traced:  6.24  x  109  rays 

Rate  of  ray  tracing:  6  092  rays  traced  •  s-1  •  processor-1 


Fig.  3.  Comparison  of  ray-tracing  rates. 

2)  Many  rays  in  the  complete  simulation  never  reached 
the  image  plane.  Being  abandoned  part  way  through  the 
entire  system,  they  took  less  time  to  trace. 

IV.  Comparison 

Fig.  3  provides  plots  of  three  different  measurements. 

1)  Rate  of  ray  tracing  achieved  by  the  FORTRAN  program 
executing  on  a  DEC  Alpha  computer.  This  program  could 
only  be  executed  on  a  single  processor.  The  rate  is  marked 
by  a  black  dot  (•). 

2)  Rates  of  ray  tracing  estimated  from  partial  simulations 
with  nonoptimized  code.  These  are  from  the  data  pre¬ 
sented  in  Table  I.  The  rates  are  marked  by  black  squares 
(■)• 

3)  Rates  of  ray  tracing  estimated  from  full  simulations  with 
optimized  code.  These  are  from  the  data  presented  in 
Table  II.  The  rates  are  marked  by  black  diamonds  (♦). 

Additionally,  the  plot  shows  least  square  linear  fits  for 
the  data  from  the  two  C  programs.  Equations  for  these  two 
lines  are 

rate  of  tracing  rays  under  full  simulation 

=  ((6  091.6695  ±  0.0074)n  +  0.024  ±  0.038)  rays  •  s”1  (1) 
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TABLE  III 
Residual  Analysis 


Observed  Predicted 

Rate  /  Rate  /  Residual  f  Residual2  / 


N 

krays 

•  s 

l 

krays 

•  s 

l 

rays  • 

s  1 

rays^ 

-s' 

2 

1 

6.091 

646 

43 

6.091 

693 

17 

-0.046 

743 

271 

0.002 

184 

933 

2 

12.183 

357 

21 

12.183 

362 

66 

-0.005 

446 

525 

0.000 

029 

665 

3 

18.275 

033 

53 

18.275 

032 

14 

0.001 

395 

983 

0.000 

001 

949 

4 

24.366 

738 

93 

24.366 

701 

61 

0.037 

320 

391 

0.001 

392 

812 

5 

30.458 

426 

43 

30.458 

371 

09 

0.055 

342 

298 

0.003 

062 

770 

6 

36.550 

090 

62 

36.550 

040 

57 

0.050 

053 

448 

0.002 

505 

348 

7 

42.641 

647 

53 

42.641 

710 

05 

-0.062 

513 

228 

0.003 

907 

904 

8 

48.733 

350 

12 

48.733 

379 

53 

-0.029 

409 

097 

0.000 

864 

895 

Sum  0.013  950  275 


N  =  the  number  of  processors  available 
VSurn 

Average  residual  =  -  =  0.014763  91  rays  •  s 


and 

rate  of  tracing  rays  under  partial  simulation 

=  ((2  848.0  ±  l.l)n  +  0.2  ±  3.1)  rays  •  s”1  (2) 

where  n  is  the  number  of  processors  assigned. 

The  stated  uncertainties  are  ±1  standard  deviation.  We 
believe  that  the  uncertainties  are  smaller  in  (1)  than  in  (2) 
because  (1)  stemmed  from  tracing  6.24  x  109  rays  while  (2) 
stemmed  from  tracing  only  3.5  x  105  rays.  It  seems  unlikely 
that  switching  from  nonoptimized  to  optimized  compilation 
would  significantly  affect  the  uncertainties. 

The  fact  that  the  standard  deviations  associated  with  the  coef¬ 
ficients  in  these  equations  are  so  small  shows  that  there  is  a  very 
highly  linear  relationship  between  the  number  of  processors 
assigned  and  the  rate  at  which  the  work  can  be  performed.  In 
turn,  this  implies  that  the  administrative  overhead  associated 
with  managing  up  to  eight  processors  is  negligible.  As  a  result, 
we  should  be  able  to  add  many  more  processors  to  the  system 
before  the  administrative  overhead  becomes  significant. 

Another  approach  to  gauging  the  suitability  of  a  linear  least 
squares  fit  is  to  compare  the  observed  measurements  to  the 
values  predicted  by  the  fit.  One  way  to  do  this  is  to  take  their 
differences,  square  them,  add  them  up,  take  the  square  root 
of  this  sum,  and  divide  by  the  number  of  terms  in  the  sum. 
Doing  this  yields  0.01476391  rays  •  s-1,  as  shown  in  Table  III. 
Comparing  this  to  any  of  the  tabulated  observations  makes  it 
clear  that  this  is  a  negligible  quantity  and  supports  our  view 
that  the  least  squares  fit  is  suitable. 

The  effect  of  the  master  processor  (the  Dell  computer)  on  the 
rate  of  ray  tracing  has  been  deliberately  ignored.  The  master 
processor  traces  29  161  rays  from  the  pinhole  attenuator  to 
the  diffuser.  Compared  to  the  6237  100485  rays  traced  by  the 
DSPs  from  the  diffuser  to  the  image  plane,  the  time  taken  to 
trace  these  early  rays  is  negligible. 

V.  Effects  of  Optimization 

The  graph  in  Fig.  3  shows  a  marked  improvement  in 
the  rate  of  ray  tracing  from  2848  rays  •  (s  -  processor)-1  to 
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TABLE  IV 

Comparison  of  Nonoptimized  and  Optimized  Versions  of  the 
Program  With  Four  Processors  Available 


Simulation 

Rate  of 

Program 

Time  / 

Ray  Tracing  / 

Version 

ks 

krays • s-1 

Optimized 

255.967  797 

24.366  738  93 

Non-Optimized 

439.985  516 

14.182  285  04 

9)  T2?p  =  30.721  s  ±  290  /is  is  the  time  to  execute  the  par¬ 
tial  simulation  using  the  nonoptimized  program  with  four 
assigned  processors,  assuming  the  same  clock  accuracy. 

Equations  (1)  and  (2)  show  that  the  rate  of  ray  tracing  is  very 
nearly  proportional  to  the  number  of  processors  assigned.  Our 
objective  is  to  find  p  and  rq  given  all  the  other  quantities.  We 
can  do  this  by  first  writing  down  several  relationships: 


Speed-up  due  to  optimization  =  71.8% 


6092  rays  •  (s  •  processor)-1.  As  remarked  above,  this  improve¬ 
ment  can  be  attributed  in  part  to  the  fact  that  not  all  rays  traverse 
every  optical  surface  within  the  instrument,  and  in  part  to  our 
having  used  an  optimizing  compiler  for  the  faster  program. 

In  order  to  distinguish  between  these  two  effects,  we  per¬ 
formed  a  full  simulation  using  four  DSPs  with  a  nonoptimized 
version  of  the  compiled  program  and  compared  the  time  that 
version  took  to  the  time  required  by  the  optimized  version  of 
the  program  using  the  same  number  of  DSPs. 

To  analyze  the  results,  we  define  the  following  quantities. 

1)  n,  as  before,  is  the  number  of  processors  assigned. 

2)  N  =  6  237 100  485  is  the  number  of  rays  traced  in  a  full 
simulation. 

3)  Np  =  349  932  is  the  number  of  rays  traced  in  a  partial 
simulation. 

4)  p  <  1  is  the  average  fraction  of  all  rays  that  make  it  to  the 
image  plane.  Tracing  pN  rays  to  the  image  plane  would 
take  the  same  time  as  tracing  all  N  rays  and  discarding 
a  ray  as  soon  as  it  is  discovered  not  to  reach  the  image 
plane.  At  this  point,  we  still  consider  it  an  unknown,  but 
we  want  to  determine  its  value. 

5)  T\  is  the  rate  of  tracing  those  rays  that  do  reach  the  image 
plane  using  the  optimized  program,  and  it  is  measured  in 
rays  •  (s  •  processor)-1.  At  this  point,  we  still  consider  it 
an  unknown,  but  we  want  to  determine  its  value. 

6)  r 2,  likewise,  is  the  rate  of  tracing  those  rays  that  do 
reach  the  image  plane  using  the  nonoptimized  program, 
also  measured  in  rays  •  (s  •  processor)-1.  This  is  given  in 
(2)  as  v 2  =  (2848.0  ±  1.1)  rays  •  (s  •  processor)-1,  if  we 
ignore  the  small  fixed  offset. 

7)  tijF  =  255  967.797  s  ±  290  /is  is  the  time  to  execute 
the  full  simulation  using  the  optimized  program  with  four 
assigned  processors,  assuming  the  timing  measurements 
are  accurate  to  within  ±0.5  ms  over  the  full  duration 
of  the  measurement.  The  clock  routines  provided  in  the 
C  runtime  library  offer  the  current  time  to  the  nearest 
1.0  ms.  If  we  assume  the  round-off  error  is  uniformly 
distributed  in  the  range  [—0.5  ms,  0.5  ms],  then  the 
standard  deviation  of  the  error  is  290  /is  [6]  .4 

8)  T2,f  =  439  985.516  s  ±  290  /is  is  the  time  to  execute  the 
full  simulation  using  the  nonoptimized  program  with  four 
assigned  processors,  assuming  the  same  clock  accuracy 
(Table  IV). 


pN 


pN 


NP 


n,F  =  - -  T2,F  =  T -  r2,P  =  - - .  (3) 


4ri 


4r2 


4r2 


The  4  in  each  denominator  is  due  to  our  having  used  n  = 
4  processors  when  measuring  n5 f,  72,f,  and  T2,p. 

Dividing  ti?f  by  T25f>  we  have 


Ti5F  _  pN  4r2  _  r2 
72, f  4rq  pN  7*1 


therefore 

ri  =  r2 =  ((2848.0  ±  1.1)  rays  •  (s  •  processor)-1) 
ti,f 

/ 439 985.516  s  ±  290  /is\ 

X  V  255  967.797s  ±290  ps  )  ' 

Optimized  ray-tracing  rate 

7*1  =  (4895.5  ±  1.9)  rays  •  (s  •  processor)-1.  (5) 

This  is  the  rate  of  ray  tracing  with  an  optimized  program 
when  every  ray  is  traced  all  the  way  to  the  image  plane.5  A 
fair  comparison  with  other  systems  that  do  ray  tracing  can  be 
obtained  by  multiplying  this  by  the  number  of  surfaces  traced 
within  the  DSPs,  which  is  25: 

ray-tracing  rate  for  fair  comparison 

=  (122  386  ±  47)  rays  •  surfaces  •  (s  •  processor)-1.  (6) 

In  our  experiments,  then,  using  eight  processors,  we  achieved 


Ray-tracing  rate  for  fair  comparison  with  eight  processors 

=  (979  090  ±  380)  rays  •  surfaces  •  s-1.  (7) 

Now,  we  turn  to  the  determination  of  the  average  fraction  p 
of  rays  that  make  it  to  the  image  plane  in  our  system.  Dividing 
t2, f  by  t2?p,  we  get 


72, f  pN  4r2  pN 
72, P  4t"2  Np  Np 


4We  use  this  method  of  estimating  the  standard  deviation  of  the  error  in  all 
clock  measurements  throughout  this  paper.  The  method  neglects  any  error  due 

to  clock  drift.  5 Methods  of  estimating  error  propagation  are  given  by  Bevington  [7]. 
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Solving  for  p ,  we  get 

_  Np  T25F 
P  N  T2,p 

(  349932  \  / 439 985.516  s  d=  290  ps\ 

~  v 6  237 100  485  A  30.721  s  ±  290  ps  )  ' 

Effective  fraction  of  rays  reaching  the  image  plane 

p  =  0.8035334  ±  0.0000076.  (8) 

Therefore,  in  our  full  simulation  of  the  MODIS  instrument,  we 
traced  N  rays,  discarding  any  that  went  astray;  the  time  taken 
was  roughly  equivalent  to  that  needed  to  trace  80%  N  rays  all 
the  way  to  the  image  plane.  We  can  use  this  result  to  adjust  (1), 
again  ignoring  the  small  fixed  offset: 

rate  of  tracing  rays  under  full  simulation 
=  (6  091.669  5  ±0.0074) 

x  (0.8035334  ±  0.0000076)  rays  •  (s  •  processor)-1 
=  (4  894.860  ±  0.046)  rays  •  (s  •  processor)-1.  (9) 

Note  how  consistent  this  is  with  the  value  of  rq  already  given  in 
(5).  The  reason  for  the  difference  is  that  the  value  in  (5)  was 
calculated  from  r^p  and  72,f>  whereas  the  value  in  (9)  was 
calculated  from  T2,f  and  T2;p. 

We  could  also  calculate  p  in  another  way  using  (3)  and  (5): 

4ri7i f 

P=  N 

_  4  ((4895.5  ±  1.9)  rays  •  s”1)  (255  967.797  s  ±  290  /ns) 

~  6  237 100  485  rays 

p  =  0.80363  ±0.00031.  (10) 

Comparing  this  to  (8),  we  see  that  both  methods  yield  values 
of  p  that  are  very  close  to  one  another.  The  uncertainty  from 
the  second  method  of  estimating  p  is  larger  because  it  used  the 
shorter  execution  time  associated  with  a  partial  simulation,  with 
its  correspondingly  larger  relative  error.  The  close  agreement 
shows  persuasively  that  our  simulation  was  equivalent  to  trac¬ 
ing  just  80%  of  all  the  rays  we  actually  traced  but  following  this 
lesser  number  of  rays  all  the  way  to  the  image  plane. 

We  can  calculate  the  (fractional)  speedup  in  the  program  due 
to  optimization  by  using  a  formula  widely  used  in  computer 
architecture  [8].  Here,  we  compare  the  tracing  rate  of  the 
optimized  program  to  that  of  the  nonoptimized  program: 

,  ri  -  r2 
speedup  = - 

T2 

(4895.5  ±  1.9)  rays  •  (s  •  processor)-1 
(2848.0  ±  1.1)  rays  •  (s  •  processor)-1 

speedup  from  optimization  =  (71.891  ±  0.094)%.  (11) 

We  can  also  calculate  the  apparent  speedup  due  to  the  fact 
that,  in  the  full  simulation,  not  all  rays  reach  the  image  plane. 
Here,  we  compare  the  apparent  rate  of  ray  tracing  using  the 


optimized  program  to  the  rate  that  would  apply  if  all  traced  rays 
indeed  reached  the  image  plane: 

—  —  T‘i 

speedup  =  — - 

r  i 

1 

=  --l 

P 

=  0.8035334  ±  0.0000076  _L 
Apparent  speedup  due  to  rays  going  astray 

=  (24.45033  ±  0.00034)%.  (12) 

The  speedup  given  in  (11)  is  real  and  is  due  to  using  an 
optimized  program.  That  given  in  (12)  is  illusory  caused  by  a 
simulation  that  traces  a  large  number  of  rays  that  do  not  actu¬ 
ally  reach  the  image  plane.  Quoting  this  artificial  percentage 
speedup  could  be  highly  misleading  since  it  depends  on  the 
choice  of  traced  rays  and  not  on  the  processor  speed  or  the  ray¬ 
tracing  program  itself. 

The  speedup  in  (11)  applies  if  a  single  processor  is  used. 
We  can  finally  calculate  the  speedup  due  to  using  n  processors 
running  an  optimized  program  rather  than  a  single  one  running 
a  nonoptimized  program: 

nr i  -  r2  nrx  , 

speedup  = - = - 1 

r2  r2 

n(4895.5  ±  1.9)  rays  •  (s  •  processor)-1 
(2848.0  ±  1.1)  rays  •  (s  •  processor)-1 

speedup  using  n  processors  with  optimized  code 

=  (1.71891  ±  0.00094)n  -  1.  (13) 

Although  we  used  DSPs  for  our  slave  processors,  these 
are  not  ideally  suited  to  the  ray-tracing  problem  because  ray 
tracing  requires  extensive  use  of  division  and  square  roots.6 
The  Analog  Devices  ADSP21160  DSPs  we  used  do  not  fully 
support  either  of  these  floating-point  operations  in  hardware. 
Rather,  software  is  used  to  perform  them  with  an  attendant 
penalty  in  performance.  It  would  be  far  more  efficient  to  use 
slave  processors  with  hardware  that  fully  supports  floating¬ 
point  arithmetic  such  as  that  defined  by  ANSI/IEEE  Std 
754-1985,  which  is  the  IEEE  standard  for  binary  floating-point 
arithmetic.  We  used  ADSP21160  DSPs  for  two  primary  rea¬ 
sons:  We  had  them  on  hand  and  they  provide  a  simple  interface 
to  a  standard  host  personal  computer. 

The  original  FORTRAN  program  executing  on  a  single  DEC 
Alpha  3000  series  model  800  computer  took  two  weeks  to 
perform  a  full  simulation.  We  reduced  this  to  35.55  h  using  two 
Hammerhead  boards  with  eight  ADSP21 160  DSPs,  which  is  an 
improvement  of  nearly  one  order  of  magnitude. 

6  As  explained  in  our  earlier  paper  [3],  the  need  for  sine  and  cosine  calcu¬ 
lations  can  be  confined  to  the  initialization  phase  of  the  program’s  execution, 
where  the  functions  that  require  them  are  computed  exactly  once.  During  the 
repetitive  phase  that  calculates  ray  trajectories,  trigonometric  functions  are 
calculated  without  using  expensive  (that  is,  time  consuming)  sine  or  cosine 
functions. 
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Comparing  the  speeds  of  different  processors  is  often  done 
using  the  published  benchmarks.  Getting  a  comparison  between 
a  general-purpose  microprocessor  such  as  the  DEC  Alpha  3000 
and  a  digital  signal  processor  such  as  the  ADSP21160  is  diffi¬ 
cult  because  they  address  different  markets  and  so  are  evaluated 
using  different  benchmarks.  A  rough  comparison  can  be  made 
by  linking  together  the  results  of  several  unrelated  benchmarks 
and  treating  the  processors  compared  as  being  fairly  repre¬ 
sented  by  all  of  them.  A  further  assumption  in  this  method  is 
that  benchmark  scores  are  linearly  related  to  performance. 

Longbottom  [9]  gives  the  results  of  Whetstone  benchmarks 
showing  single-precision  and  double-precision  floating-point 
arithmetic  performance  of  various  processors.  The  Intel  Pen¬ 
tium  III  with  a  clock  frequency  of  1 .4  GHz  received  scores  of 
972  and  1006,  with  an  average  of  989.  The  DEC  Alpha  series 
3000  Model  800  with  a  clock  frequency  of  200  MHz  received 
scores  of  155  and  129,  with  an  average  of  142.  A  comparison 
of  these  two  averages  permits  us  to  estimate  that  the  Pentium 
III  is  7.0  times  faster  than  the  Alpha. 

Berkeley  Design  Technology,  Inc.  (BDTI)  has  developed  a 
benchmark  for  signal-processing  speeds  [10]  with  benchmark 
results  published  on  the  Internet  [11].  The  Analog  Devices 
ADSP-2126x  with  a  clock  frequency  of  200  MHz  received  a 
score  of  1090.  The  Intel  Pentium  III  with  a  clock  frequency  of 
1.4  GHz  received  a  score  of  3130.  A  comparison  of  these  two 
scores  permits  us  to  estimate  that  the  Pentium  III  is  2.8  times 
faster  than  the  2126x. 

Analog  Devices  publishes  comparisons  of  their  32-bit  DSP 
microprocessors  on  the  Internet  [12].  The  Analog  Devices 
ADSP-21262  with  a  clock  frequency  of  200  MHz  can  ex¬ 
ecute  up  to  1.2  x  109  floating-point  instructions  per  second 
(1.2  GFLOPS).  The  Analog  Devices  ADSP-21160  with  a  clock 
frequency  of  80  MHz  can  execute  up  to  480  x  106  floating¬ 
point  instructions  per  second  (480  MFLOPS).  A  comparison  of 
these  two  scores  permits  us  to  estimate  that  the  ADSP-21262 
is  2.5  times  faster  than  the  ADSP-21160.  Combining  these 
factors,  we  have 


speedADgP_21 16Q 
speedAlpha 

speedADgP_21 16Q 
speedADgP_21262 

speedADgP_21262  \ 
speedPentium  m  ) 


m  0.97  »  1. 


f  speedPentium  m  \ 
V  speedAlpha  ) 


Such  a  comparison  is  crude,  at  best,  and  neglects  the  impor¬ 
tant  fact  that  Alpha  has  floating-point  hardware  to  support 
division  [13],  whereas  the  ADSP-21160  does  not.  Even  so,  it 
suggests  a  roughly  comparable  level  of  performance  between 
the  two  processors.  Note,  though,  that  the  ADSP-21160  can 
only  achieve  the  rated  speed  when  processing  two  sets  of 
data  with  the  same  instruction  simultaneously.  We  did  not  use 


this  mode,  so  the  performance  should  only  be  half  that  of  an 
Alpha  3000. 

In  fact,  the  optimized  program  with  one  ADSP-21160 
assigned  took  1023.877625  ks,  as  shown  in  Table  II,  or 
11.9  days,  compared  with  roughly  14  days  on  the  Alpha  3000 
series  model  800.  The  DSP  processor  appears  to  be  a  little  more 
efficient,  therefore,  even  though  it  was  not  running  two  sets  of 
data  simultaneously.  This  can  more  likely  be  ascribed  to  the 
differences  in  the  program  than  in  the  underlying  hardware.  A 
further  improvement  in  the  software  to  take  advantage  of  the 
DSP’s  ability  to  process  two  sets  of  data  at  a  time  might  yield  a 
significant  improvement  in  the  ray-tracing  rate. 

Bittware  now  offers  an  improved  product,  the  Danube  6-PaC 
PCI  (D6PC)  board  with  six  Analog  Devices  ADSP-TS201S 
TigerSHARC  DSPs.  These  operate  at  300  MHz,  which  is  a 
factor  of  3.75  higher  than  that  of  the  Bittware  Hammerhead 
boards  we  used,  operating  at  just  80  MHz.  If  we  assume  that 
the  increased  clock  rate  corresponds  to  an  increased  perfor¬ 
mance  rate,  then,  we  can  multiply  this  factor  by  the  value  of 
122  388  rays  •  surfaces  •  (s  •  processor)-1  given  in  (6)  to  get  an 
estimate  of  the  performance  that  a  system  using  Danube  boards 
should  be  able  to  obtain: 

3.75  x  122  388  rays  •  surfaces  •  (s  •  processor)  1 

«  459  000  rays  •  surfaces  •  (s  •  processor)-1. 

According  to  Abbott  [14],  using  a  typical  PCI  bus,  we 
can  connect  six  boards  directly  to  a  motherboard.  We  could 
therefore  envisage  using  six  Danube  boards  with  a  total  of 
36  processors.  This  would  yield 

36  processors  x  459  000  rays  •  surfaces  •  (s  •  processor)-1 

=  16.5  x  106  rays  •  surfaces  •  s-1. 

We  can  also  consider  the  effect  using  six  Danube  boards 
would  have  on  the  observed  apparent  ray-tracing  rate  in  (1), 
using  the  fastest  apparent  rate  of  ray  tracing  observed  in  our 
experiments,  as  shown  in  Table  II 

36  o  i 

48  733  rays  •  s  1  x  3.75  x  — -  =  822  x  103  rays  •  s  1 

8 


and  calculate  from  it  the  amount  of  time  we  could  expect  it 
to  take  to  perform  a  full  simulation  of  the  MODIS  instrument 
using  six  Danube  boards  with  a  total  of  36  processors: 


Total  time  = 


6  237100485  rays 
822  x  103  rays  •  s-1 


«2.1h. 


The  difference  would  represent  a  16.9-fold  increase  in  the  rate, 
corresponding  to  a  speedup  of  15.9,  or  1590%,  and  would 
amount  to  2.2  orders  of  magnitude  less  than  the  time  required 
by  the  original  FORTRAN  program  on  a  single  DEC  Alpha 
computer. 
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VI.  Conclusion 

The  number  r\  in  (5)  is  a  parameter  of  the  ray-tracing  system, 
not  of  the  optical  system.  It  gives  the  rate  of  tracing  rays 
through  a  single  surface  of  an  optical  system.  In  our  analysis, 
however,  we  have  ignored  the  fact  that  the  time  to  calculate 
the  interaction  of  light  rays  with  planar  surfaces,  ellipsoidal 
surfaces,  and  nonellipsoidal  curved  surfaces  varies  substan¬ 
tially.  We  have  instead  tacitly  assumed  that  the  mix  of  such 
surfaces  in  the  MODIS  instrument  is  representative  of  most 
instruments.  A  more  careful  analysis  would  report  different  val¬ 
ues  of  r\  for  each  kind  of  surface.  In  comparing  two  candidate 
ray-tracing  systems,  the  parameter  r\  is  an  important  one  to 
consider. 

The  number  p  in  (8),  on  the  other  hand,  is  a  parameter  of  the 
optical  system  and  of  the  choice  of  rays  used  in  the  simulation, 
not  of  the  ray-tracing  system  per  se.  To  the  extent  that  p  is 
lower  than  one,  the  simulations  entail  tracing  some  rays  part 
way  through  the  optical  system  before  discarding  them.  These 
rays  are  ill  chosen,  insofar  as  it  is  undesirable  to  spend  any  time 
tracing  rays  that  get  discarded.  Ideally,  we  would  choose  no 
such  rays  to  simulate. 

We  could  cause  p  to  be  identically  one  by,  say,  tracing  a 
single  ray  known  to  traverse  the  whole  system  successfully,  as 
we  did  in  our  earlier  work  [3],  but  then  we  would  be  neglecting 
many  rays  that  give  a  more  complete  picture  of  the  performance 
of  the  optical  system.  Simulations  could  be  regarded  as  most 
productive  when  p  is  close  to,  but  still  less  than,  the  value  one. 
In  comparing  two  ray-tracing  systems,  values  of  p  should  be 
equal  for  a  fair  comparison  of  any  particular  simulation. 

This  paper  has  made  it  clear  that  the  ray-tracing  problem 
is  very  amenable  to  parallel  processing.  Since  each  ray  is 
independent  of  all  others,  we  can  assign  any  arbitrary  subset  of 
rays  to  available  processors  for  them  to  trace.  An  administrative 
program  (executing  on  the  PC  in  our  implementation)  can  easily 
gather  the  results  and  assign  new  ray  bundles  to  large  numbers 
of  processors  without  falling  behind  or  unduly  delaying  the 
number  crunching  in  the  slave  processors. 
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